What Happens When I Press Enter? -- EDB failover manager Switchover

January 23, 2023

EnterpriseDB Failover Manager (EFM) is a great tool to automate failover/switchover if you use Postgres' streaming replication feature.  Not only do you get High Availability (HA), you can do so with just a few simple commands to make it all happen very quickly.  We recently had an issue wherein a customer sought to improve the wallclock performance of a master/standby switchover in EFM.  In the process, we started discussing what it takes for EFM to actually perform the switchover which, in some scenarios, takes as little as 5 seconds.  We thought we'd share the knowledge with everyone (note that unless specified, "Agent" is the agent where the command was run):

  1. The 'efm' command process (the CLI) first checks that a master exists and that it and all standbys are in sync. You can promote a standby any time, even if there's no master       or if things are out of sync. But a switchover will not happen unless there's a master and everything is in sync. CLI sends signal to the local agent is to start promotion/switchover.
  2. Agent retrieves recovery.conf text from a standby.
  3. Agent sends text to the original master. This is a signal to the master that it should become a standby after the normal manual promotion steps occur (steps 3-10 below).
  4. Master agent drops the VIP and writes out the recovery.conf file (note that host address is incorrect at this point).
  5. Master agent stops monitoring its local database, stops the database, and becomes IDLE.
  6. A standby is chosen (replay paused, standby chosen based on xlog location and priority) for promotion and enters PROMOTING state.
  7. Standby agent runs fencing script, if it exists.
  8. Standby agent writes trigger file and resumes replay.
  9. Thread started in #6 above monitors database for it to come out of recovery and become master. This is where the recovery.check.period property is used.
  10. Other standby agents reconfigure recovery.conf files to point to new master.
  11. Other standby agents stop monitoring the local database, restart the database, and resume monitoring.
  12. Original master agent reconfigures recovery.conf file to point to new master.
  13. Original master agent starts database and resumes monitoring.


That's it! Note that the new master is promoted before the old one is restarted as a standby, so you actually have a new master faster than the total time.

 

Enjoy!

Share this